Linux Basics

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

December 29, 2023

Display machine information for reproducibility:

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.1.2  fastmap_1.1.1   cli_3.6.2       tools_4.1.2    
 [5] htmltools_0.5.7 yaml_2.3.8      rmarkdown_2.25  knitr_1.45     
 [9] jsonlite_1.8.8  xfun_0.41       digest_0.6.33   rlang_1.1.2    
[13] evaluate_0.23  

Windows Git Bash users need to set bash engine to Git Bash:

# only on Windows Git Bash
knitr::opts_chunk$set(engine.path = list(
  bash = "C:\\Program\ Files\\Git\\bin\\bash.exe"
))

1 Preface

  • This html is rendered from linux.qmd on Linux Ubuntu 22.04 (jammy).
    • Mac users can render linux.qmd directly. Some tools such as tree and locate need to be installed (follow the error messages).

    • Windows users need to install Git for Windows to render linux.qmd using Git Bash or install WSL (Windows Subsystem for Linux) to render linux.qmd using Ubuntu. Some tools such as tree and locate need to be installed (follow the error messages).

    • Both Mac and Windows users can also use Docker to render linux.qmd within a Ubuntu container.

  • In this lecture, most code chunks are bash commands instead of R code.

2 Why Linux

Linux is the most common platform for scientific computing and deployment of data science tools.

  • Open source and community support.

  • Things break; when they break using Linux, it’s easy to fix.

  • Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super computers.

    • E.g. UCLA Hoffmann2 cluster runs on Linux; most machines in cloud (AWS, Azure, GCP) run on Linux.
  • Cost: it’s free!

3 Distributions of Linux

  • Debian/Ubuntu is a popular choice for personal computers.

  • RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)

  • UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).

  • MacOS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to MacOS terminal as well. Windows/DOS, unfortunately, is a totally different breed.

  • Show operating system (OS) type:

echo $OSTYPE
linux-gnu
  • Show distribution/version on Linux:
# only on Linux terminal
cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
  • Show distribution/version on MacOS:
# only on Mac terminal
sw_vers -productVersion

or

# only on Mac terminal
system_profiler SPSoftwareDataType

4 Linux shells

4.1 Shells

  • A shell translates commands to OS instructions.

  • Most commonly used shells include bash, csh, tcsh, zsh, etc.

  • The default shell in MacOS changed from bash to zsh since MacOS v10.15.

  • Sometimes a command and a script does not run simply because it’s written for another shell.

  • We mostly use bash shell commands in this class.

  • Determine the current shell:

echo $SHELL
/bin/bash
  • List available shells (on Linux or MacOS):
cat /etc/shells
# /etc/shells: valid login shells
/bin/sh
/bin/bash
/usr/bin/bash
/bin/rbash
/usr/bin/rbash
/usr/bin/sh
/bin/dash
/usr/bin/dash
/usr/bin/tmux
/usr/bin/screen
  • Change to another shell:
```{bash}
#| eval: false
exec bash -l
```

The -l option indicates it should be a login shell.

  • Change your login shell permanently:
chsh -s /bin/bash [USERNAME]

Then log out and log in.

4.2 Command history and bash completion

We can navigate to previous/next commands by the upper and lower keys, or maintain a command history stack using pushd and popd commands.

Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!

  • Pathname completion.

  • Filename completion.

  • Variablename completion: echo $[TAB][TAB].

  • Username completion: cd ~[TAB][TAB].

  • Hostname completion ssh huazhou@[TAB][TAB].

  • It can also be customized to auto-complete other stuff such as options and command’s arguments. Google bash completion for more information.

4.3 man is man’s best friend

Online help for shell commands: man [COMMANDNAME].

# display the first 30 lines of documentation for the ls command
man ls | head -30
LS(1)                            User Commands                           LS(1)

NAME
       ls - list directory contents

SYNOPSIS
       ls [OPTION]... [FILE]...

DESCRIPTION
       List  information  about  the FILEs (the current directory by default).
       Sort entries alphabetically if none of -cftuvSUX nor --sort  is  speci‐
       fied.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.

       -a, --all
              do not ignore entries starting with .

       -A, --almost-all
              do not list implied . and ..

       --author
              with -l, print the author of each file

       -b, --escape
              print C-style escapes for nongraphic characters

       --block-size=SIZE
              with  -l,  scale  sizes  by  SIZE  when  printing  them;   e.g.,

6 Work with text files

6.1 View/peek text files

  • cat prints the contents of a file:
cat runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)
  • head prints the first 10 lines of a file:
head runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }

head -l prints the first \(l\) lines of a file:

head -15 runSim.R
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}
  • tail prints the last 10 lines of a file:
tail runSim.R
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)

tail -l prints the last \(l\) lines of a file:

tail -15 runSim.R
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)
  • Questions:
    • How to see the 11th line of the file and nothing else?
    • What about the 11th to the last line?

6.2 Piping and redirection

  • | sends output from one command as input of another command.
ls -l | head -5
total 6876
-rw-r--r-- 1 huazhou huazhou  110345 Dec 28 21:08 Emacs_Reference_Card.pdf
-rw-r--r-- 1 huazhou huazhou  157353 Dec 28 21:08 IDRE_Winter_2019_Workshops.pdf
-rw-r--r-- 1 huazhou huazhou  141962 Dec 28 21:08 Richard_Stallman_2013.png
-rw-r--r-- 1 huazhou huazhou  199492 Dec 28 21:08 Vi_Cheat_Sheet.pdf
  • > directs output from one command to a file.

  • >> appends output from one command to a file.

  • < reads input from a file.

  • Combinations of shell commands (grep, sed, awk, …), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently.

  • See HW1.

6.3 less is more; more is less

  • more browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the q key.

  • less is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.

  • less doesn’t need to read the whole file, i.e., it loads files faster than more.

6.4 grep

grep prints lines that match an expression:

  • Show lines that contain string CentOS:
# quotes not necessary if not a regular expression
grep 'CentOS' linux.qmd
- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.qmd
grep 'CentOS' *.qmd
grep -n 'CentOS' linux.qmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
  • Search multiple text files:
grep 'CentOS' *.qmd
- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).
- Show lines that contain string `CentOS`:
grep 'CentOS' linux.qmd
grep 'CentOS' *.qmd
grep -n 'CentOS' linux.qmd
- Replace `CentOS` by `RHEL` in a text file:
sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
  • Show matching line numbers:
grep -n 'CentOS' linux.qmd
58:- RHEL/CentOS is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
64:- UCLA Hoffman2 cluster runs CentOS 7.9.2009 (as of 2023-01-01).
361:- Show lines that contain string `CentOS`:
364:grep 'CentOS' linux.qmd
369:grep 'CentOS' *.qmd
374:grep -n 'CentOS' linux.qmd
391:- Replace `CentOS` by `RHEL` in a text file:
393:sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
  • Find all files in current directory with .png extension:
ls | grep '.png$'
Richard_Stallman_2013.png
key_authentication_1.png
key_authentication_2.png
linux_directory_structure.png
linux_filepermission.png
linux_filepermission_oct.png
redhat_kills_centos.png
screenshot_top.png
  • Find all directories in the current directory:
ls -al | grep '^d'
drwxr-xr-x 2 huazhou huazhou    4096 Dec 29 00:45 .
drwxr-xr-x 4 huazhou huazhou    4096 Dec 28 21:08 ..

6.5 sed

  • sed is a stream editor.

  • Replace CentOS by RHEL in a text file:

sed 's/CentOS/RHEL/' linux.qmd | grep RHEL
- RHEL/RHEL is popular on servers. (In December 2020, Red Hat terminated the development of CentOS Linux distribution.)
- UCLA Hoffman2 cluster runs RHEL 7.9.2009 (as of 2023-01-01).
- Show lines that contain string `RHEL`:
grep 'RHEL' linux.qmd
grep 'RHEL' *.qmd
grep -n 'RHEL' linux.qmd
- Replace `RHEL` by `RHEL` in a text file:
sed 's/RHEL/RHEL/' linux.qmd | grep RHEL

6.6 awk

  • awk is a filter and report writer.

  • First let’s display the content of the file /etc/passwd (this file only exists in Linux and MacOS):

cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13:proxy:/bin:/usr/sbin/nologin
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
backup:x:34:34:backup:/var/backups:/usr/sbin/nologin
list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin
irc:x:39:39:ircd:/run/ircd:/usr/sbin/nologin
gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
systemd-network:x:100:102:systemd Network Management,,,:/run/systemd:/usr/sbin/nologin
systemd-resolve:x:101:103:systemd Resolver,,,:/run/systemd:/usr/sbin/nologin
messagebus:x:102:105::/nonexistent:/usr/sbin/nologin
systemd-timesync:x:103:106:systemd Time Synchronization,,,:/run/systemd:/usr/sbin/nologin
syslog:x:104:111::/home/syslog:/usr/sbin/nologin
_apt:x:105:65534::/nonexistent:/usr/sbin/nologin
uuidd:x:106:112::/run/uuidd:/usr/sbin/nologin
tcpdump:x:107:113::/nonexistent:/usr/sbin/nologin
huazhou:x:1000:1000:,,,:/home/huazhou:/bin/bash
rstudio-server:x:999:999::/home/rstudio-server:/bin/sh

Each line contains fields (1) user name, (2) password, (3) user ID, (4) group ID, (5) user ID info, (6) home directory, and (7) command shell, separated by :.

  • Print sorted list of login names:
awk -F: '{ print $1 }' /etc/passwd | sort | head -10
_apt
backup
bin
daemon
games
gnats
huazhou
irc
list
lp
  • Print number of lines in a file, as NR stands for Number of Rows:
awk 'END { print NR }' /etc/passwd
28

or

wc -l /etc/passwd
28 /etc/passwd

or (not displaying file name)

wc -l < /etc/passwd
28
  • Print login names with UID in range 1000-1035:
awk -F: '{if ($3 >= 1000 && $3 <= 1047) print}' /etc/passwd
huazhou:x:1000:1000:,,,:/home/huazhou:/bin/bash
  • Print login names and log-in shells in comma-separated format:
awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
root,/bin/bash
daemon,/usr/sbin/nologin
bin,/usr/sbin/nologin
sys,/usr/sbin/nologin
sync,/bin/sync
games,/usr/sbin/nologin
man,/usr/sbin/nologin
lp,/usr/sbin/nologin
mail,/usr/sbin/nologin
news,/usr/sbin/nologin
uucp,/usr/sbin/nologin
proxy,/usr/sbin/nologin
www-data,/usr/sbin/nologin
backup,/usr/sbin/nologin
list,/usr/sbin/nologin
irc,/usr/sbin/nologin
gnats,/usr/sbin/nologin
nobody,/usr/sbin/nologin
systemd-network,/usr/sbin/nologin
systemd-resolve,/usr/sbin/nologin
messagebus,/usr/sbin/nologin
systemd-timesync,/usr/sbin/nologin
syslog,/usr/sbin/nologin
_apt,/usr/sbin/nologin
uuidd,/usr/sbin/nologin
tcpdump,/usr/sbin/nologin
huazhou,/bin/bash
rstudio-server,/bin/sh
  • Print login names and indicate those with UID>1000 as vip:
awk -F: -v status="" '{OFS = ","} 
{if ($3 >= 1000) status="vip"; else status="regular"} 
{print $1, status}' /etc/passwd
root,regular
daemon,regular
bin,regular
sys,regular
sync,regular
games,regular
man,regular
lp,regular
mail,regular
news,regular
uucp,regular
proxy,regular
www-data,regular
backup,regular
list,regular
irc,regular
gnats,regular
nobody,vip
systemd-network,regular
systemd-resolve,regular
messagebus,regular
systemd-timesync,regular
syslog,regular
_apt,regular
uuidd,regular
tcpdump,regular
huazhou,vip
rstudio-server,regular

6.7 Text editors

Source: Editor War on Wikipedia.

6.7.1 Emacs

  • Emacs is a powerful text editor with extensive support for many languages including R, \(\LaTeX\), python, and C/C++; however it’s not installed by default on many Linux distributions.

  • Basic survival commands:

    • emacs filename to open a file with emacs.
    • CTRL-x CTRL-f to open an existing or new file.
    • CTRL-x CTRX-s to save.
    • CTRL-x CTRL-w to save as.
    • CTRL-x CTRL-c to quit.
  • Google emacs cheatsheet

C-<key> means hold the control key, and press <key>.
M-<key> means press the Esc key once, and press <key>.

6.7.2 Vi

  • Vi is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.

  • Basic survival commands:

    • vi filename to start editing a file.
    • vi is a modal editor: insert mode and normal mode. Pressing i switches from the normal mode to insert mode. Pressing ESC switches from the insert mode to normal mode.
    • :x<Return> quits vi and saves changes.
    • :q!<Return> quits vi without saving latest changes.
    • :w<Return> saves changes.
    • :wq<Return> quits vi and saves changes.
  • Google vi cheatsheet

7 IDE (Integrated Development Environment)

  • Statisticians/data scientists write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.

  • RStudio, Eclipse, Emacs, Matlab, Visual Studio, VS Code, etc.

8 Processes

8.1 Cancel a non-responding program

  • Press Ctrl+C to cancel a non-responding or long-running program.

8.2 Processes

  • OS runs processes on behalf of user.

  • Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.

ps
    PID TTY          TIME CMD
    410 ?        00:00:00 systemd
    411 ?        00:00:00 (sd-pam)
    451 ?        00:00:12 rsession
   2974 ?        00:00:00 quarto
   2982 ?        00:00:00 deno
   3002 ?        00:00:00 R
   3121 ?        00:00:00 sh
   3122 ?        00:00:00 ps
  • All current running processes:
ps -eaf
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 Dec28 ?        00:00:00 /sbin/init
root           2       1  0 Dec28 ?        00:00:00 /init
root           5       2  0 Dec28 ?        00:00:00 plan9 --control-socket 6 --log-level 4 --server-fd 7 --pipe-fd 9 --log-truncate
root          40       1  0 Dec28 ?        00:00:00 /lib/systemd/systemd-journald
root          59       1  0 Dec28 ?        00:00:01 /lib/systemd/systemd-udevd
root          74       1  0 Dec28 ?        00:00:00 snapfuse /var/lib/snapd/snaps/bare_5.snap /snap/bare/5 -o ro,nodev,allow_other,suid
root          76       1  0 Dec28 ?        00:00:00 snapfuse /var/lib/snapd/snaps/core22_864.snap /snap/core22/864 -o ro,nodev,allow_other,suid
root          77       1  0 Dec28 ?        00:00:00 snapfuse /var/lib/snapd/snaps/gtk-common-themes_1535.snap /snap/gtk-common-themes/1535 -o ro,nodev,allow_other,suid
root          78       1  0 Dec28 ?        00:00:01 snapfuse /var/lib/snapd/snaps/snapd_20290.snap /snap/snapd/20290 -o ro,nodev,allow_other,suid
root          81       1  0 Dec28 ?        00:00:00 snapfuse /var/lib/snapd/snaps/ubuntu-desktop-installer_1276.snap /snap/ubuntu-desktop-installer/1276 -o ro,nodev,allow_other,suid
systemd+      88       1  0 Dec28 ?        00:00:00 /lib/systemd/systemd-resolved
root         125       1  0 Dec28 ?        00:00:00 /usr/sbin/cron -f -P
message+     126       1  0 Dec28 ?        00:00:00 @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root         135       1  0 Dec28 ?        00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
syslog       137       1  0 Dec28 ?        00:00:00 /usr/sbin/rsyslogd -n -iNONE
root         139       1  0 Dec28 ?        00:00:03 /usr/lib/snapd/snapd
root         140       1  0 Dec28 ?        00:00:00 /lib/systemd/systemd-logind
root         182       1  0 Dec28 ?        00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root         187       1  0 Dec28 hvc0     00:00:00 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
root         196       1  0 Dec28 tty1     00:00:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
rstudio+     215       1  0 Dec28 ?        00:00:04 /usr/lib/rstudio-server/bin/rserver
root         373       2  0 Dec28 ?        00:00:00 /init
root         374     373  0 Dec28 ?        00:00:00 /init
huazhou      375     374  0 Dec28 pts/0    00:00:00 -bash
root         376       2  0 Dec28 pts/1    00:00:00 /bin/login -f
huazhou      410       1  0 Dec28 ?        00:00:00 /lib/systemd/systemd --user
huazhou      411     410  0 Dec28 ?        00:00:00 (sd-pam)
huazhou      416     376  0 Dec28 pts/1    00:00:00 -bash
huazhou      451     215  0 Dec28 ?        00:00:12 /usr/lib/rstudio-server/bin/rsession -u huazhou --session-use-secure-cookies 0 --session-root-path / --session-same-site 0 --session-use-file-storage 1 --launcher-token ACC6AAAA --r-restore-workspace 2 --r-run-rprofile 2
root         811       1  0 Dec28 ?        00:00:00 snapfuse /var/lib/snapd/snaps/core22_1033.snap /snap/core22/1033 -o ro,nodev,allow_other,suid
root        1070       1  0 Dec28 ?        00:00:00 snapfuse /var/lib/snapd/snaps/ubuntu-desktop-installer_1278.snap /snap/ubuntu-desktop-installer/1278 -o ro,nodev,allow_other,suid
root        1143       1  0 Dec28 ?        00:00:00 /bin/bash /snap/ubuntu-desktop-installer/1278/bin/subiquity-server
root        1168    1143  0 Dec28 ?        00:00:03 /snap/ubuntu-desktop-installer/1278/usr/bin/python3.10 -m subiquity.cmd.server --use-os-prober --storage-version=2 --postinst-hooks-dir=/snap/ubuntu-desktop-installer/1278/etc/subiquity/postinst.d
root        1172    1168  0 Dec28 ?        00:00:04 python3 /snap/ubuntu-desktop-installer/1278/usr/bin/cloud-init status --wait
huazhou     2185     451  0 00:23 pts/2    00:00:00 bash --login --posix
root        2972      59  0 00:45 ?        00:00:00 /lib/systemd/systemd-udevd
root        2973      59  0 00:45 ?        00:00:00 /lib/systemd/systemd-udevd
huazhou     2974     451  0 00:45 ?        00:00:00 /bin/bash /opt/quarto/bin/quarto preview linux.qmd --to html --no-watch-inputs --no-browse
huazhou     2982    2974 71 00:45 ?        00:00:00 /opt/quarto/bin/tools/deno-x86_64-unknown-linux-gnu/deno run --unstable --no-config --cached-only --allow-read --allow-write --allow-run --allow-env --allow-net --allow-ffi --no-check --importmap=/opt/quarto/bin/vendor/import_map.json /opt/quarto/bin/quarto.js preview linux.qmd --to html --no-watch-inputs --no-browse
root        3000      59  0 00:45 ?        00:00:00 /lib/systemd/systemd-udevd
root        3001      59  0 00:45 ?        00:00:00 /lib/systemd/systemd-udevd
huazhou     3002    2982  0 00:45 ?        00:00:00 /usr/lib/R/bin/exec/R --no-echo --no-restore --file=/opt/quarto/share/rmd/rmd.R
huazhou     3123    3002  0 00:45 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf' 2>&1
huazhou     3124    3123  0 00:45 ?        00:00:00 ps -eaf
  • All Python processes:
ps -eaf | grep python
root         135       1  0 Dec28 ?        00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root         182       1  0 Dec28 ?        00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root        1168    1143  0 Dec28 ?        00:00:03 /snap/ubuntu-desktop-installer/1278/usr/bin/python3.10 -m subiquity.cmd.server --use-os-prober --storage-version=2 --postinst-hooks-dir=/snap/ubuntu-desktop-installer/1278/etc/subiquity/postinst.d
root        1172    1168  0 Dec28 ?        00:00:04 python3 /snap/ubuntu-desktop-installer/1278/usr/bin/cloud-init status --wait
huazhou     3125    3002  0 00:45 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf | grep python' 2>&1
huazhou     3126    3125  0 00:45 ?        00:00:00 bash -c ps -eaf | grep python
huazhou     3128    3126  0 00:45 ?        00:00:00 grep python
  • Process with PID=1:
ps -fp 1
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 Dec28 ?        00:00:00 /sbin/init
  • All processes owned by a user:
ps -fu $USER
UID          PID    PPID  C STIME TTY          TIME CMD
huazhou      375     374  0 Dec28 pts/0    00:00:00 -bash
huazhou      410       1  0 Dec28 ?        00:00:00 /lib/systemd/systemd --user
huazhou      411     410  0 Dec28 ?        00:00:00 (sd-pam)
huazhou      416     376  0 Dec28 pts/1    00:00:00 -bash
huazhou      451     215  0 Dec28 ?        00:00:12 /usr/lib/rstudio-server/bin/rsession -u huazhou --session-use-secure-cookies 0 --session-root-path / --session-same-site 0 --session-use-file-storage 1 --launcher-token ACC6AAAA --r-restore-workspace 2 --r-run-rprofile 2
huazhou     2185     451  0 00:23 pts/2    00:00:00 bash --login --posix
huazhou     2974     451  0 00:45 ?        00:00:00 /bin/bash /opt/quarto/bin/quarto preview linux.qmd --to html --no-watch-inputs --no-browse
huazhou     2982    2974 71 00:45 ?        00:00:00 /opt/quarto/bin/tools/deno-x86_64-unknown-linux-gnu/deno run --unstable --no-config --cached-only --allow-read --allow-write --allow-run --allow-env --allow-net --allow-ffi --no-check --importmap=/opt/quarto/bin/vendor/import_map.json /opt/quarto/bin/quarto.js preview linux.qmd --to html --no-watch-inputs --no-browse
huazhou     3002    2982  0 00:45 ?        00:00:00 /usr/lib/R/bin/exec/R --no-echo --no-restore --file=/opt/quarto/share/rmd/rmd.R
huazhou     3131    3002  0 00:45 ?        00:00:00 sh -c 'bash'  -c 'ps -fu $USER' 2>&1
huazhou     3132    3131  0 00:45 ?        00:00:00 ps -fu huazhou

8.3 Kill processes

  • Kill process with PID=1001:
```{bash}
#| eval: false
kill 1001
```
  • Kill all R processes.
```{bash}
#| eval: false
killall -r R
```

8.4 top

  • top prints realtime process information (very useful).
```{bash}
#| eval: false
top
```

  • Exit the top program by pressing the q key.

9 Secure shell (SSH)

9.1 SSH

SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.

  • On Linux or Mac Terminal, access a Linux machine by
```{bash}
#| eval: false
ssh [USERNAME]@[IP_ADDRESS]
```

Replace above [USERNAME] by your account user name on the Linux machine and [IP_ADDRESS] by the machine’s ip address. For example, to connect to the Hoffman2 cluster at UCLA

```{bash}
#| eval: false
ssh huazhou@hoffman2.idre.ucla.edu
```
  • For Windows users, there are at least three ways: (1) (recommended) Git Bash which is included in Git for Windows, (2) (not recommended) PuTTY program (free), or (3) (highly recommended) use WSL for Windows to install a full fledged Linux system within Windows.

9.2 Advantages of keys over password

  • Key authentication is more secure than password. Most passwords are weak.

  • Script or a program may need to systematically SSH into other machines.

  • Log into multiple machines using the same key.

  • Seamless use of many services: Git/GitHub, AWS or Google cloud service, parallel computing on multiple hosts, Travis CI (continuous integration) etc.

  • Many servers only allow key authentication and do not accept password authentication.

9.3 Key authentication

  • Public key. Put on the machine(s) you want to log in.

  • Private key. Put on your own computer. Consider this as the actual key in your pocket; never give private keys to others.

  • Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.

  • Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).

9.4 Steps to generate keys

  • On Linux, Mac, or Windows Git Bash, to generate a key pair:
ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
    • [KEY_FILENAME] is the name that you want to use for your SSH key files. For example, a filename of id_rsa generates a private key file named id_rsa and a public key file named id_rsa.pub.

    • [USERNAME] is the user for whom you will apply this SSH key.

    • Use a (optional) paraphrase different from password.

  • Set correct permissions on the .ssh folder and key files.
    • The permission for the ~/.ssh folder should be 700 (drwx------).
    • The permission of the private key ~/.ssh/id_rsa should be 600 (-rw-------).
    • The permission of the public key ~/.ssh/id_rsa.pub should be 644 (-rw-r--r--).
chmod 700 ~/.ssh
chmod 600 ~/.ssh/[KEY_FILENAME]
chmod 644 ~/.ssh/[KEY_FILENAME].pub
Note Windows is different, it doesn't allow change of permissions.
  • Append the public key to the ~/.ssh/authorized_keys file of any Linux machine we want to SSH to, e.g.,
ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@[IP_ADDRESS]

Make sure the permission of the authorized_keys file is 600 (-rw-------).

  • Test your new key.
ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@[IP_ADDRESS]
  • From now on, you don’t need password each time you connect from your machine to the teaching server.

  • If you set paraphrase when generating keys, you’ll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using ssh-agent on Linux/Mac or Pagent on Windows.

  • Same key pair can be used between any two machines. We don’t need to regenerate keys for each new connection.

9.5 Transfer files between machines

  • scp securely transfers files between machines using SSH.
## copy file from local to remote
scp [LOCALFILE] [USERNAME]@[IP_ADDRESS]:/[PATH_TO_FOLDER]
## copy file from remote to local
scp [USERNAME]@[IP_ADDRESS]:/[PATH_TO_FILE] [PATH_TO_LOCAL_FOLDER]
  • sftp is FTP via SSH.

  • Globus is GUI program for securely transferring files between machines. To use Globus you will have to go to https://www.globus.org/ and login through UCLA by selecting your existing organizational login as UCLA. Then you will need to download their Globus Connect Personal software, then set your laptop as an endpoint. Very detailed instructions can be found at https://www.hoffman2.idre.ucla.edu/file-transfer/globus/.

  • GUIs for Windows (WinSCP) or Mac (Cyberduck).

  • You can even use RStudio to upload files to a remote machine with RStudio Server installed.

  • (Preferred way) Use a version control system (git, svn, cvs, …) to sync project files between different machines and systems.

9.6 Line breaks in text files

  • Windows uses a pair of CR and LF for line breaks.

  • Linux/Unix uses an LF character only.

  • MacOS X also uses a single LF character. But old Mac OS used a single CR character for line breaks.

  • If transferred in binary mode (bit by bit) between OSs, a text file could look a mess.

  • Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.

10 Run R in Linux

10.1 Interactive mode

  • Start R in the interactive mode by typing R in shell.

  • Then run R script by

source("script.R")

10.2 Batch mode

  • Demo script meanEst.R implements an (terrible) estimator of mean \[ {\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{i \text{ is prime}}}{\sum_{i=1}^n 1_{i \text{ is prime}}}. \]
cat meanEst.R
## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

print(estMeanPrimes(rnorm(100000)))
  • To run your R code non-interactively aka in batch mode, we have at least two options:
#! eval: false
# default output to meanEst.Rout
R CMD BATCH meanEst.R

or

# output to stdout
Rscript meanEst.R
  • Typically automate batch calls using a scripting language, e.g., Python, Perl, and shell script.

10.3 Pass arguments to R scripts

  • Specify arguments in R CMD BATCH:
R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
  • Specify arguments in Rscript:
Rscript script.R mu=1 sig=2 kap=3
  • Parse command line arguments using magic formula
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

in R script. After calling the above code, all command line arguments will be available in the global namespace.

  • To understand the magic formula commandArgs, run R by:
R '--args mu=1 sig=2 kap=3'

and then issue commands in R

commandArgs()
commandArgs(TRUE)
  • Understand the magic formula parse and eval:
rm(list = ls())
print(x)
parse(text = "x=3")
eval(parse(text = "x=3"))
print(x)
  • runSim.R has components: (1) command argument parser, (2) method implementation, (3) data generator with unspecified parameter n, and (4) estimation based on generated data.
## parsing command arguments
for (arg in commandArgs(TRUE)) {
  eval(parse(text=arg))
}

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function (x) {
  n = length(x)
  ind = sapply(1:n, isPrime)
  return (mean(x[ind]))
}

# simulate data
x = rnorm(n)

# estimate mean
estMeanPrimes(x)
  • Call runSim.R with sample size n=100:
R CMD BATCH '--args n=100' runSim.R

or

Rscript runSim.R n=100
[1] -0.001304939

10.4 Run long jobs

  • Many statistical computing tasks take long: simulation, MCMC, etc. If we exit Linux when the job is unfinished, the job is killed.

  • nohup command in Linux runs program(s) immune to hangups and writes output to nohup.out by default. Logging out will not kill the process; we can log in later to check status and results.

  • nohup is POSIX standard thus available on Linux and MacOS.

  • Run runSim.R in background and writes output to nohup.out:

nohup Rscript runSim.R n=100 &
[1] -0.1542147

The & at the end of the command instructs Linux to run this command in background, so we gain control of the terminal immediately.

10.5 screen

  • screen is another popular utility, but not installed by default.

  • Typical workflow using screen.

    1. Access remote server using ssh.

    2. Start jobs in batch mode.

    3. Detach jobs.

    4. Exit from server, wait for jobs to finish.

    5. Access remote server using ssh.

    6. Re-attach jobs, check on progress, get results, etc.

10.6 Use R to call R

R in conjunction with nohup (or screen) can be used to orchestrate a large simulation study.

  • It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.

  • We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.

  • Python in many ways makes a better glue.

  • Suppose we have

    • runSim.R which runs a simulation based on command line argument n.
    • A large collection of n values that we want to use in our simulation study.
    • Access to a server with 128 cores.
      How to parallelize the job?
  • Option 1: manually call runSim.R for each setting.

  • Option 2 (smarter): automate calls using R and nohup.

  • Let’s demonstrate using the script autoSim.R

cat autoSim.R
# autoSim.R

nVals <- seq(100, 1000, by=100)
for (n in nVals) {
  oFile <- paste("n", n, ".txt", sep="")
  sysCall <- paste("nohup Rscript runSim.R n=", n, " > ", oFile, sep="")
  system(sysCall, wait = FALSE)
  print(paste("sysCall=", sysCall, sep=""))
}

Note when we call bash command using the system function in R, we set optional argument wait=FALSE so that jobs can be run parallel.

Rscript autoSim.R
[1] "sysCall=nohup Rscript runSim.R n=100 > n100.txt"
[1] "sysCall=nohup Rscript runSim.R n=200 > n200.txt"
[1] "sysCall=nohup Rscript runSim.R n=300 > n300.txt"
[1] "sysCall=nohup Rscript runSim.R n=400 > n400.txt"
[1] "sysCall=nohup Rscript runSim.R n=500 > n500.txt"
[1] "sysCall=nohup Rscript runSim.R n=600 > n600.txt"
[1] "sysCall=nohup Rscript runSim.R n=700 > n700.txt"
[1] "sysCall=nohup Rscript runSim.R n=800 > n800.txt"
[1] "sysCall=nohup Rscript runSim.R n=900 > n900.txt"
[1] "sysCall=nohup Rscript runSim.R n=1000 > n1000.txt"
  • Now we just need to write a script to collect results from the output files.

  • Later we will learn how to coordinate large scale computation on UCLA Hoffman2 cluster, using Linux and R scripting.

11 Some other Linux commands

  • Log out Linux: exit or logout or ctrl+d.

  • Clear screen: clear.